If you’ve ever searched online for a recipe or asked “what should I make for dinner” in a search engine, one of the results is most likely from allrecipes.com.
Allrecipes.com is recipe-sharing platform with over 100,000 recipes and 60 million users globally. Users can submit their own recipes as well as interact with other users by commenting, reviewing, or rating recipes. Allrecipes is unique as it is a public forum for recipes and a community for anyone who wants to cook, rather than a carefully curated blog. You can find family heirlooms like Jewish Grandma’s Best Beef Brisket or simple recipes like these basic crepes (which I have tried and can attest to!)
In addition to being a great resource when you’re in a cooking rut, it has a trove of data on each page. So much so, that Brian Mubia took notice and decided to create the tastyR package.
The tastyR package contains two datasets, allrecipes and cuisines. For this project, we’ll dive into cuisines - a dataset containing over 2,000 recipes from allrecipes with information on ingredients, cuisine, nutrition, reviews, and ratings.
After reviewing the cuisines dataset, I was most interested in two components: the cuisine variable, which closely parallels country of origin, and the ingredients variable. We often group country’s cuisines into broader categories by geographic location — for example, Italy, Greece, and Turkey are commonly considered Mediterranean food. I was curious to see whether that intuition holds up based on ingredient usage or if we see that geographically distant cuisines have more in common than we might have assumed.
To help guide the project, I came up with some questions to try and answer.
To answer these questions, we will:
Before diving into the data analysis, I want to go over the contents of the dataset, data cleaning steps and how ingredients were tokenized.
| variable | type | label |
|---|---|---|
| country | character | Cuisine |
| name | character | Name of Recipe |
| url | character | URL |
| author | character | Author |
| date_published | Date | Date Published or Last Updated |
| ingredients | character | List of Ingredients |
| calories | integer | Calories per Serving |
| fat | integer | Fat per Serving |
| carbs | integer | Carbs per Serving |
| protein | integer | Protein per Serving |
| avg_rating | numeric | Average Ratings |
| total_ratings | integer | Total Number of Ratings |
| reviews | integer | Total Number of Reviews |
| prep_time | integer | Prep Time (in minutes) |
| cook_time | integer | Cook Time (in minutes) |
| total_time | integer | Total Time (in minutes) |
| servings | integer | Number of Servings |
The cuisines dataset contains 2,218 records with 17 different variables, described above. Records are uniquely identified by name and author.
Ingredients are comma-delimited with measurement and units but not standardized. Fat, carbs, and protein are measured in grams. Ratings are on a 1-star to 5-star rating scale.
Something to note is that total ratings and reviews are erroneously truncated to the thousands, unless there were less than 1000 ratings total. This was discovered when creating frequency tables and confirmed online (Github - TidyTuesday Data).
In the exploratory data analysis, we will cover some basic descriptive statistics on some of these variables to give us a general idea of the recipes we are working with.
For data cleaning, the following steps were taken:
stringdistmatrix,hclust, and
cutree. The threshold for cutree was 0.05 and
duplicates were manually reviewed to ensure that they were similar
enough to be considered the same and the threshold was appropriate.Outliers in numeric variables were not examined as these variables will be not used in the main analysis. After data cleaning, 9 records were removed from the initial dataset.
Below shows example of what the raw ingredients variable look like.
## ingredients
## 1 1 pound sliced bacon, diced, 1 medium sweet onion, chopped, 9 large eggs, lightly beaten, 4 cups frozen shredded hash brown potatoes, thawed, 2 cups shredded Cheddar cheese, 1 ½ cups small curd cottage cheese, 1 ¼ cups shredded Swiss cheese
## 2 3 egg yolks, 1 tablespoon lemon juice, ¼ teaspoon Dijon mustard, 1 dash hot pepper sauce (e.g. Tabasco™), ½ cup butter
## 3 oil for deep frying, 1 cup unbleached all-purpose flour, 2 teaspoons salt, ½ teaspoon ground black pepper, ½ teaspoon cayenne pepper, ½ teaspoon paprika, ¼ teaspoon garlic powder, 1 large egg, 1 cup milk, 3 skinless, boneless chicken breasts, cut into 1/2-inch strips, ¼ cup hot pepper sauce, 1 tablespoon butter
## 4 1 orange, 1 lemon, 1 lime, 1 (750 milliliter) bottle dry red wine, 1 ½ cups rum, 1 cup orange juice, ½ cup white sugar, or to taste
## 5 4 skinless, boneless chicken breast halves - pounded to ½-inch thickness, salt and pepper to taste, 2 tablespoons all-purpose flour, 1 egg, beaten, 1 cup panko bread crumbs, 1 cup oil for frying, or as needed
As you can see, it is contains a lot of information, represented in different forms. For example, there is additional text within parentheses as well as measurements and method of preparation (ex. “chopped”).
For the purposes of PCA, we will standardized and tokenized the ingredients so we get one row per ingredient per recipe, with no measurement, unit, or additional information. Adjectives that are unnecessary such as small,large etc. will be removed.
The following steps were taken:
stringdistmatrix,hclust, and
cutree like above.The threshold for cutree was
0.10 and clusters were manually reviewed to ensure that they were true
misspellings.Standardizing the ingredients was not a trivial effort. Given the amount of recipes and ingredients, it is not guaranteed that every case was accounted for in this step.
After these steps, this is what the ingredients column became:
## food
## 1 bacon
## 2 onion
## 3 egg
## 4 potato
## 5 cheddar cheese
## 6 cottage cheese
## 7 swiss cheese
## 8 yolk
## 9 lemon juice
## 10 dijon mustard
## 11 pepper sauce
## 12 butter
## 13 flour
## 14 salt
## 15 pepper
## 16 cayenne pepper
## 17 paprika
## 18 garlic powder
## 19 egg
## 20 milk
In the data analysis section, we will discuss the top ingredients.
To get a sense of the data we are working with, I produced some basic graphs and a table with descriptive statistics on most of the variables.
We can see above that we are dealing with recipes mostly from the last 5 years, so they should reflect current food trends.
The above graphs display the spread of nutritional variables. With 50% of recipes having less than 11 grams of protein, we may have more recipes that are vegetarian rather than meat-based. The median value of calories is about 320 and so the recipes are most likely moderate and not indulgent.
The table below includes a more numerical look at the raw variables.
| variable | level | statistics |
|---|---|---|
| Number of Records | NA | 2209 |
| date_published | [2005,2010) | 1 (0.05%) |
| date_published | [2010,2015) | 10 (0.45%) |
| date_published | [2015,2020) | 48 (2.17%) |
| date_published | [2020,2025] | 2150 (97.33%) |
| calories | mean(sd) | 358.16 (239.27) |
| calories | median(q1,q3) | 319.5 (190, 477) |
| fat | mean(sd) | 18.76 (16.96) |
| fat | median(q1,q3) | 15 (7, 26) |
| carbs | mean(sd) | 31.87 (25.87) |
| carbs | median(q1,q3) | 26 (13, 45) |
| protein | mean(sd) | 16.61 (16.3) |
| protein | median(q1,q3) | 11 (4, 25) |
| avg_rating | mean(sd) | 4.51 (0.4) |
| avg_rating | median(q1,q3) | 4.6 (4.3, 4.8) |
| reviews | mean(sd) | 77.06 (142.25) |
| reviews | median(q1,q3) | 21 (6, 74) |
| prep_time | mean(sd) | 21.53 (60.84) |
| prep_time | median(q1,q3) | 15 (10, 25) |
| cook_time | mean(sd) | 41.8 (63.23) |
| cook_time | median(q1,q3) | 25 (10, 45) |
| total_time | mean(sd) | 171 (642.8) |
| total_time | median(q1,q3) | 60 (35, 120) |
| servings | mean(sd) | 10.47 (13.44) |
| servings | median(q1,q3) | 8 (4, 12) |
Next, we will look at the two critical variables for this analysis - cuisine and ingredients. Below are graphs detailing the proportion of cuisines by recipe and the most common ingredients.
There are over 40 different cuisines, most of which are directly related to a single country with the exception of a few such as Jewish, Cajun and Creole, Amish and Mennonite, and Southern Recipes. The largest percentage of recipes comes from Brazil and Filipino while the lowest is Belgian. There are several missing cuisines on this dataset and it is a limitation in this analysis. For example, we do not have any data on countries in Africa besides South Africa.
The top 25 ingredients are not very surprising and make intuitive sense. I am a little surprised that onion is first compared to salt and water. For the rest of the graphs and the PCA analysis, I will remove the top 10 ingredients as I don’t think they are very informative in defining a country’s ingredient profile.
To answer what are the most common ingredients by cuisine, the following histograms graphs were created. This allows us to step through all the cuisines and take notice of which ingredients appeared the most. I organized them by the UN Geoscheme for regions and when a cuisine was not a country, I grouped it with where the cuisine is typically associated with. The only cuisine I did not do this was Jewish cuisine since they cover multiple areas and are dispersed globally.We will use the grouping to color the individual cuisine when we graph their PC1 and PC2.
To get a look at the top 50 ingredients and all the cuisines at once, I created this heatmap. This allows us to identify which cuisines have similar proportion for each of the top 50 ingredients. The y-axis is grouped by region and the x-axis is ordered from most common to least common.
prcomp() to the ingredient proportion matrix
with centering and scaling.cat('<iframe src="results/plots/tsne_recipes.html" width="100%" height="600" frameborder="0"></iframe>')
## ingredient PC1
## ginger root ginger root 0.1531790
## cilantro cilantro 0.1290366
## sprout sprout 0.1264313
## chicken thigh chicken thigh 0.1233345
## star anise star anise 0.1231000
## ginger ginger 0.1183448
## rice noodle rice noodle 0.1140547
## cilantro leaf cilantro leaf 0.1120065
## coriander coriander 0.1070627
## peanut peanut 0.1064545
## ingredient PC2
## rice vinegar rice vinegar 0.1764492
## ketchup ketchup 0.1684169
## ginger ginger 0.1598751
## rice wine vinegar rice wine vinegar 0.1410089
## sprout sprout 0.1402169
## chile paste chile paste 0.1383847
## ginger root ginger root 0.1297077
## tofu tofu 0.1291296
## chicken thigh chicken thigh 0.1254302
## honey honey 0.1225864